Exploratory Data Analysis: Wisconsin Diagnostic Breast Cancer (WDBC)¶

1.1 Introduction¶

This report analyzes the Wisconsin Diagnostic Breast Cancer (WDBC) dataset to identify key features distinguishing malignant from benign tumors. The data features were computed from digitized images of fine needle aspirates (FNA) of breast masses, describing the characteristics of the cell nuclei present in the image (Wolberg et al., 1995).

1.2 Data Acquisition¶

The raw data was retrieved directly from the UCI Machine Learning Repository to ensure reproducibility. The dataset consists of 569 instances with 30 real-valued input features and one binary target variable (Diagnosis).

In [1]:
# Imports (only nesseccary for EDA)
import pandas as pd
import numpy as np

import altair_ally as aly
import altair as alt
alt.data_transformers.enable('vegafusion')

from ucimlrepo import fetch_ucirepo
In [2]:
# import the data
# Code from https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
# Need ucimlrepo package to load the data
raw_data = fetch_ucirepo(id=17)

raw_X = raw_data.data.features
raw_y = raw_data.data.targets

raw_df = pd.concat([raw_X, raw_y], axis=1)
raw_df.to_csv("../data/raw/breast_cancer_raw.csv", index=False)

2. Data Cleaning, Schema Mapping, and Data Validation¶

The raw dataset lacks semantic column headers. To facilitate analysis, we implemented a schema mapping strategy based on the wdbc.names metadata. The 30 features represent ten distinct cell nucleus characteristics (e.g., Radius, Texture) computed in three statistical forms.

We applied the following suffix mapping transformation:

  • Mean Value: Suffix 1 -> _mean
  • Standard Error: Suffix 2 -> _se
  • Worst (Max) Value: Suffix 3 -> _max

This step ensures all features are semantically interpretable for the subsequent EDA.

To ensure the dataset is clean, consistent, and ready for modeling, we validated the data by implementing the following checks:

  • Correct Data File Format The cleaned dataset was exported as a standard CSV (breast_cancer_cleaned.csv) with UTF-8 encoding. It successfully loaded in pandas without errors, confirming proper file format and readability.

  • Correct Column Names All 30 feature columns follow the expected naming convention (_mean, _se, _max). The Diagnosis column contains only "Benign" and "Malignant" values, and the total number of columns is 31, as expected.

  • No Empty Observations All rows contain complete observations. There are no fully empty rows, and no partial missing values were detected.

  • Missingness Within Expected Threshold A threshold of 5% missingness per column was applied. No columns exceeded this limit, ensuring all features are sufficiently complete for reliable modeling.

By combining schema mapping with these validation checks, the dataset is fully consistent, correctly formatted, and reproducible, providing a robust foundation for downstream modeling and analysis.

In [3]:
# Attempt loading the file to ensure it’s a valid CSV [Ensuring correct Data File Format]
try:
    df = pd.read_csv('../data/processed/breast_cancer_cleaned.csv')
    print("File loaded successfully. Format OK.")
except Exception as e:
    raise AssertionError(f"File format error: {e}")

# Ensure no unnamed index column is within the data
assert not any(df.columns.str.contains("Unnamed")), \
    "Error: Unnamed index column detected!"

# Clean the column names based on description
clean_columns = []
for col in raw_X.columns:
    if col.endswith('1'):
        clean_name = col[:-1] + '_mean'
    elif col.endswith('2'):
        clean_name = col[:-1] + '_se'
    elif col.endswith('3'):
        clean_name = col[:-1] + '_max'
    else:
        clean_name = col
    
    clean_columns.append(clean_name)
raw_X.columns = clean_columns
X = raw_X.copy()

# Clean the target column
y = raw_y.copy()
y['Diagnosis'] = y['Diagnosis'].map({'M': 'Malignant', 'B': 'Benign'})
clean_df = pd.concat([X, y], axis=1)

# Must be 31 columns: 30 features + Diagnosis
assert clean_df.shape[1] == 31, f"Unexpected number of columns: {df.shape[1]}"

# Check naming pattern
allowed_suffixes = ("_mean", "_se", "_max")

feature_cols = [c for c in clean_df.columns if c != "Diagnosis"]

# All feature columns must end with one of the suffixes
for col in feature_cols:
    assert col.endswith(allowed_suffixes), f"Invalid column name: {col}"

# Diagnosis must contain valid labels
assert clean_df["Diagnosis"].isin(["Benign", "Malignant"]).all(), "Invalid Diagnosis values detected"

print("Column names OK")

empty_rows = df.isna().all(axis=1).sum()
assert empty_rows == 0, f"Found {empty_rows} completely empty rows!"
print("No empty observations.")

partial_missing = df.isna().any(axis=1).sum()
print(f"Rows with any missing values: {partial_missing}")

threshold = 0.05
missing_ratio = df.isna().mean()
too_high = missing_ratio[missing_ratio > threshold]
assert too_high.empty, f"Columns exceeding missingness threshold:\n{too_high}"
print("Missingness below threshold.")

# Export the cleaned data
clean_df.to_csv('../data/processed/breast_cancer_cleaned.csv', index=False)

clean_df
File loaded successfully. Format OK.
Column names OK
No empty observations.
Rows with any missing values: 0
Missingness below threshold.
Out[3]:
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean ... texture_max perimeter_max area_max smoothness_max compactness_max concavity_max concave_points_max symmetry_max fractal_dimension_max Diagnosis
0 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.30010 0.14710 0.2419 0.07871 ... 17.33 184.60 2019.0 0.16220 0.66560 0.7119 0.2654 0.4601 0.11890 Malignant
1 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.08690 0.07017 0.1812 0.05667 ... 23.41 158.80 1956.0 0.12380 0.18660 0.2416 0.1860 0.2750 0.08902 Malignant
2 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.19740 0.12790 0.2069 0.05999 ... 25.53 152.50 1709.0 0.14440 0.42450 0.4504 0.2430 0.3613 0.08758 Malignant
3 11.42 20.38 77.58 386.1 0.14250 0.28390 0.24140 0.10520 0.2597 0.09744 ... 26.50 98.87 567.7 0.20980 0.86630 0.6869 0.2575 0.6638 0.17300 Malignant
4 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.19800 0.10430 0.1809 0.05883 ... 16.67 152.20 1575.0 0.13740 0.20500 0.4000 0.1625 0.2364 0.07678 Malignant
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
564 21.56 22.39 142.00 1479.0 0.11100 0.11590 0.24390 0.13890 0.1726 0.05623 ... 26.40 166.10 2027.0 0.14100 0.21130 0.4107 0.2216 0.2060 0.07115 Malignant
565 20.13 28.25 131.20 1261.0 0.09780 0.10340 0.14400 0.09791 0.1752 0.05533 ... 38.25 155.00 1731.0 0.11660 0.19220 0.3215 0.1628 0.2572 0.06637 Malignant
566 16.60 28.08 108.30 858.1 0.08455 0.10230 0.09251 0.05302 0.1590 0.05648 ... 34.12 126.70 1124.0 0.11390 0.30940 0.3403 0.1418 0.2218 0.07820 Malignant
567 20.60 29.33 140.10 1265.0 0.11780 0.27700 0.35140 0.15200 0.2397 0.07016 ... 39.42 184.60 1821.0 0.16500 0.86810 0.9387 0.2650 0.4087 0.12400 Malignant
568 7.76 24.54 47.92 181.0 0.05263 0.04362 0.00000 0.00000 0.1587 0.05884 ... 30.37 59.16 268.6 0.08996 0.06444 0.0000 0.0000 0.2871 0.07039 Benign

569 rows × 31 columns

In [4]:
# Data Validation (5-8)
# 5. Correct data types in each column
feature_cols = [c for c in clean_df.columns if c != 'Diagnosis']
assert clean_df[feature_cols].select_dtypes(include=['number']).shape[1] == len(feature_cols), \
    "Error: Non-numeric features detected."
# 6. No duplicate observations
n_duplicates = clean_df.duplicated().sum()
assert n_duplicates == 0, \
    f"Error: {n_duplicates} duplicate observations detected."
In [5]:
# 7. No outlier or anomalous values
suffixes = ['_mean', '_se', '_max']
charts = []

for suffix in suffixes:
    cols = [c for c in clean_df.columns if c.endswith(suffix) and c != 'Diagnosis']
    
    # The range of 'Area' feature (values ~1000+) exceed other features (values < 1).
    # We use 'symlog' (Symmetric Log) to handle potential 0 values in features like 'Concavity'.
    
    chart = alt.Chart(clean_df).mark_boxplot(extent=1.5).encode(
        x=alt.X('value:Q', title='Value (Symlog)', scale=alt.Scale(type='symlog')),
        y=alt.Y('variable:N', title='Feature'),
        color=alt.Color('Diagnosis:N', title='Diagnosis'),
        tooltip=['variable:N', 'value:Q', 'Diagnosis']
    ).transform_fold(
        cols,
        as_=['variable', 'value']
    ).properties(
        title=f'Distribution of {suffix} Features',
        width=400,
        height=300
    )
    
    charts.append(chart)

display(alt.vconcat(*charts).resolve_scale(x='independent'))

Outlier Analysis & Scaling Strategy¶

Due to the massive disparity in magnitude between features (e.g., Area > 2000 vs. Smoothness < 0.2), a Symmetric Log (Symlog) scale was applied to the visualizations. This effectively mitigates the compressing effect of the Area feature, allowing the distribution and spread of smaller-scale variables to be clearly observed without losing the information from larger values.

Post-scaling inspection reveals numerous outliers (points beyond whiskers), particularly in Malignant samples (Orange).

  • Significance: These are not data errors. In the context of breast cancer, extreme values in features like Area, Concavity, and Perimeter are characteristic of malignant tumor growth.
  • Conclusion: These points represent high-priority biological signals essential for classification.

Preprocessing Recommendation

  • Action: Do not drop these outliers, as removing them would discard critical diagnostic information.
  • Strategy: To handle the skewness and scale differences during modeling:
    1. Apply Log Transformation (np.log1p) to right-skewed features (Area, Perimeter) to normalize distributions.
    2. Apply Standard Scaling (StandardScaler) to all features to ensure the model treats all dimensions with equal weight.
In [6]:
# 8. Correct category levels (i.e., no string mismatches or single values)
target_counts = clean_df['Diagnosis'].value_counts(dropna=False)
## No Single category
assert len(target_counts) > 1, \
    f"Error: Single category detected. Only found: {target_counts.index.tolist()}"
## At least 2 samples per category
min_samples = target_counts.min()
assert min_samples > 1, \
    f"Error: Found a category with a single observation! Min samples: {min_samples}"
## No unexpected category labels
expected_classes = {'Malignant', 'Benign'}
actual_classes = set(target_counts.index)
assert actual_classes == expected_classes, \
    f"Error: Unexpected category labels found! Found: {actual_classes}, Expected: {expected_classes}"
  1. Target/response variable follows expected distribution
  • We validate that the target variable Diagnosis is not severely imbalanced. If one class is much rarer than the other, this can hurt model performance and may require special handling (for example, resampling or adjusting evaluation metrics).
  1. No anomalous correlations between target and features
  • We check how strongly each feature is associated with Diagnosis. Extremely high predictive power for a single feature can indicate data leakage or unexpected dependencies that should be investigated.
  1. No anomalous correlations between features
  • We examine pairwise correlations between features. If many feature pairs are highly correlated, this suggests redundancy or multicollinearity, which may require feature selection or dimensionality reduction.
In [7]:
from deepchecks.tabular import Dataset
from deepchecks.tabular.checks import ClassImbalance, FeatureLabelCorrelation
from deepchecks.tabular.checks.data_integrity import FeatureFeatureCorrelation

bc_dataset = Dataset(
    clean_df,
    label='Diagnosis'
)

# 9. Target/response variable follows expected distribution
class_imbalance_check = ClassImbalance().add_condition_class_ratio_less_than(
    class_imbalance_ratio_th=0.2  # flag if minority / majority < 0.2
)

class_imbalance_result = class_imbalance_check.run(bc_dataset)

class_imbalance_result
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/deepchecks/core/serialization/dataframe/html.py:16: UserWarning:

pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.

deepchecks - WARNING - It is recommended to initialize Dataset with categorical features by doing "Dataset(df, cat_features=categorical_list)". No categorical features were passed, therefore heuristically inferring categorical features in the data. 0 categorical features were inferred.
VBox(children=(HTML(value='<h4><b>Class Imbalance</b></h4>'), HTML(value='<p>Check if a dataset is imbalanced …
In [8]:
# 10. Feature–target correlations
flc_check = FeatureLabelCorrelation().add_condition_feature_pps_less_than(
    threshold=0.8  # flag features that are *too* predictive of the label
)

flc_result = flc_check.run(bc_dataset)
flc_result
VBox(children=(HTML(value='<h4><b>Feature Label Correlation</b></h4>'), HTML(value='<p>Return the PPS (Predict…
In [9]:
# 11. Feature–feature correlations
ffc_check = FeatureFeatureCorrelation().add_condition_max_number_of_pairs_above_threshold(
    0.95,
    10 
)

ffc_result = ffc_check.run(bc_dataset)
ffc_result
VBox(children=(HTML(value='<h4><b>Feature-Feature Correlation</b></h4>'), HTML(value='<p>    Checks for pairwi…

The FeatureFeatureCorrelation check fails the condition we set. In our data there are many feature pairs with very high correlation, for example radius_mean, perimeter_mean and area_mean, as well as their corresponding _max and _se versions. This pattern is expected for this dataset because these variables all describe related geometric properties of the tumor, so strong correlations are not a data quality error but a sign of redundancy and multicollinearity.

3. Data Profiling: Structure and Statistics¶

Purpose:

  • df.info(): Used to verify data integrity by checking for null values and ensuring all feature columns are of float64 type.
  • df.describe(): Used to examine the central tendency and spread of numeric features. This highlights differences in magnitude (scales) across variables.

Observation: The dataset is complete (no missing values). However, describe() reveals massive scale disparities (e.g., area_mean ranges up to 2500, while smoothness_mean is < 0.1), confirming the necessity for Feature Scaling (Standardization) before modeling.

In [10]:
clean_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 569 entries, 0 to 568
Data columns (total 31 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   radius_mean             569 non-null    float64
 1   texture_mean            569 non-null    float64
 2   perimeter_mean          569 non-null    float64
 3   area_mean               569 non-null    float64
 4   smoothness_mean         569 non-null    float64
 5   compactness_mean        569 non-null    float64
 6   concavity_mean          569 non-null    float64
 7   concave_points_mean     569 non-null    float64
 8   symmetry_mean           569 non-null    float64
 9   fractal_dimension_mean  569 non-null    float64
 10  radius_se               569 non-null    float64
 11  texture_se              569 non-null    float64
 12  perimeter_se            569 non-null    float64
 13  area_se                 569 non-null    float64
 14  smoothness_se           569 non-null    float64
 15  compactness_se          569 non-null    float64
 16  concavity_se            569 non-null    float64
 17  concave_points_se       569 non-null    float64
 18  symmetry_se             569 non-null    float64
 19  fractal_dimension_se    569 non-null    float64
 20  radius_max              569 non-null    float64
 21  texture_max             569 non-null    float64
 22  perimeter_max           569 non-null    float64
 23  area_max                569 non-null    float64
 24  smoothness_max          569 non-null    float64
 25  compactness_max         569 non-null    float64
 26  concavity_max           569 non-null    float64
 27  concave_points_max      569 non-null    float64
 28  symmetry_max            569 non-null    float64
 29  fractal_dimension_max   569 non-null    float64
 30  Diagnosis               569 non-null    object 
dtypes: float64(30), object(1)
memory usage: 137.9+ KB
In [11]:
clean_df.describe()
Out[11]:
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave_points_mean symmetry_mean fractal_dimension_mean ... radius_max texture_max perimeter_max area_max smoothness_max compactness_max concavity_max concave_points_max symmetry_max fractal_dimension_max
count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 ... 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 ... 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 ... 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 ... 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 ... 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 ... 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 ... 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 ... 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

8 rows × 30 columns

4. Correlation Analysis: Pearson vs. Spearman¶

Method:

  • Pearson Correlation: Measures linear relationships.
  • Spearman Correlation: Measures monotonic rank relationships (non-linear). Comparing both helps identify if relationships are strictly linear or just trending in the same direction.

Purpose: To detect Multicollinearity—redundant features that increase model complexity without adding information.

Results: Both metrics show near-perfect correlation ($>0.95$) between Radius, Perimeter, and Area. This confirms these features are geometrically redundant. We should retain only one (e.g., Radius) and drop the others to improve model stability.

In [12]:
# Multicollinearity

corr_chart = aly.corr(clean_df)

corr_chart.save('../results/images/corr_chart.png')
corr_chart.save('../results/images/corr_chart.svg')

corr_chart
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning:

'selection_multi' is deprecated.  Use 'selection_point'

/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/vegalite/v5/api.py:394: AltairDeprecationWarning:

The value of 'empty' should be True or False.

Out[12]:

5. Pairwise Separability Analysis¶

Purpose: To visualize 2D decision boundaries. We look for feature combinations where the Benign (Blue) and Malignant (Orange) clusters are clearly distinct with minimal overlap.

Results:

  • High Separability: Features related to size (radius_mean) and shape complexity (concavity_mean) separate the classes well.
  • Non-linear patterns: The curved relationship between area and radius is clearly visible, reinforcing the geometric redundancy found in the correlation analysis.
In [13]:
# Only include mean as it provide a lot of info
cols_mean = [c for c in clean_df.columns if '_mean' in c] + ['Diagnosis']
pair_chart = aly.pair(clean_df[cols_mean], color='Diagnosis:N')

pair_chart.save('../results/images/pair_chart.png')
pair_chart.save('../results/images/pair_chart.svg')

pair_chart
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning:

'selection_multi' is deprecated.  Use 'selection_point'

/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/site-packages/altair/utils/deprecation.py:65: AltairDeprecationWarning:

'add_selection' is deprecated. Use 'add_params' instead.

Out[13]:

6. Distribution Analysis¶

Purpose: To inspect the univariate "shape" of the data. We look for Skewness (asymmetry) and Outliers that could bias linear models.

Results:

  • Skewness: Features like area_se and concavity_mean are heavily right-skewed (long tail to the right). This indicates that Log Transformation is required to normalize these distributions.
  • Overlap: "Texture" and "Smoothness" show high overlap between classes, suggesting they are less informative on their own compared to "Size" features.
In [14]:
dist_chart =aly.dist(clean_df, color='Diagnosis')

dist_chart.save('../results/images/dist_chart.png')
dist_chart.save('../results/images/dist_chart.svg')

dist_chart
Out[14]:

EDA Findings¶

  • Class Separation:
    • High Separability: Features related to size (radius, perimeter, area) and concavity (concave_points, concavity) show clear distinction between Benign and Malignant classes (Malignant samples generally have higher values).
    • Low Separability: Texture, Smoothness, and Fractal Dimension show significant overlap, indicating they are weaker individual predictors.
  • Distributions:
    • Skewness: "Area" and "Concavity" features (both _mean and _se) are heavily right-skewed.
    • Outliers: Visible in the upper tails of area_max and perimeter_se.
  • Correlations (Multicollinearity):
    • Severe Multicollinearity: radius, perimeter, and area are perfectly correlated ($R \approx 1$). This is expected geometrically but redundant for models.
    • concavity, concave_points, and compactness also exhibit very high positive correlation.

Preprocessing Recommendations¶

Based on the above, the following pipeline is suggesued:

  1. Feature Selection / Drop:
    • Remove redundant features to reduce multicollinearity. Keep radius (or perimeter), but drop area and perimeter as they duplicate information.
  2. Transformation:
    • Apply Log Transformation to skewed features (e.g., area, concavity) to normalize distributions.
  3. Scaling:
    • Features vary vastly in scale (e.g., area > 1000 vs. smoothness < 0.2). Use StandardScaler to standardize all features to unit variance.
  4. Imputation:
    • None needed (Data is clean).

Onto Creating a Classification Model¶

In [15]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

X = clean_df.drop('Diagnosis', axis=1)
y = clean_df['Diagnosis']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
In [16]:
X_train.columns
Out[16]:
Index(['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_max', 'texture_max', 'perimeter_max',
       'area_max', 'smoothness_max', 'compactness_max', 'concavity_max',
       'concave_points_max', 'symmetry_max', 'fractal_dimension_max'],
      dtype='object')
In [17]:
numeric_feats = ['radius_mean', 'texture_mean',
       'smoothness_mean', 'compactness_mean', 'concavity_mean',
       'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean',
       'radius_se', 'texture_se', 'smoothness_se',
       'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se',
       'fractal_dimension_se', 'radius_max', 'texture_max', 
       'smoothness_max', 'compactness_max', 'concavity_max',
       'concave_points_max', 'symmetry_max', 'fractal_dimension_max']

drop_feats = [
    'perimeter_mean',
    'area_mean',
    'perimeter_se',
    'area_se',
    'texture_se',
    'smoothness_se',
    'symmetry_se',
    'perimeter_max',
    'area_max'
]
In [18]:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import Pipeline

ct = make_column_transformer(    
    (StandardScaler(), numeric_feats), 
    ("drop", drop_feats)
)

pipe = Pipeline([
    ("preprocess", ct),
    ("svc", SVC())
])

param_grid = {
    "svc__gamma": [0.001, 0.01, 0.1, 1.0, 10, 100],
    "svc__C": [0.001, 0.01, 0.1, 1.0, 10, 100]
}

gs = GridSearchCV(
    estimator = pipe,
    param_grid = param_grid,
    cv = 15,
    n_jobs = -1,
    return_train_score = True
)

gs.fit(X_train, y_train)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  return _ForkingPickler.loads(res)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  return _ForkingPickler.loads(res)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  return _ForkingPickler.loads(res)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  return _ForkingPickler.loads(res)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  return _ForkingPickler.loads(res)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  return _ForkingPickler.loads(res)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  return _ForkingPickler.loads(res)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  return _ForkingPickler.loads(res)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  return _ForkingPickler.loads(res)
/Users/hugokwok/miniforge3/envs/MDS_Group37/lib/python3.11/multiprocessing/queues.py:122: UserWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.
  return _ForkingPickler.loads(res)
Out[18]:
GridSearchCV(cv=15,
             estimator=Pipeline(steps=[('preprocess',
                                        ColumnTransformer(transformers=[('standardscaler',
                                                                         StandardScaler(),
                                                                         ['radius_mean',
                                                                          'texture_mean',
                                                                          'smoothness_mean',
                                                                          'compactness_mean',
                                                                          'concavity_mean',
                                                                          'concave_points_mean',
                                                                          'symmetry_mean',
                                                                          'fractal_dimension_mean',
                                                                          'radius_se',
                                                                          'texture_se',
                                                                          'smoothness_se',
                                                                          'compactness_se',
                                                                          'concavity_se',
                                                                          'con...
                                                                          'concavity_max',
                                                                          'concave_points_max',
                                                                          'symmetry_max',
                                                                          'fractal_dimension_max']),
                                                                        ('drop',
                                                                         'drop',
                                                                         ['perimeter_mean',
                                                                          'area_mean',
                                                                          'perimeter_se',
                                                                          'area_se',
                                                                          'texture_se',
                                                                          'smoothness_se',
                                                                          'symmetry_se',
                                                                          'perimeter_max',
                                                                          'area_max'])])),
                                       ('svc', SVC())]),
             n_jobs=-1,
             param_grid={'svc__C': [0.001, 0.01, 0.1, 1.0, 10, 100],
                         'svc__gamma': [0.001, 0.01, 0.1, 1.0, 10, 100]},
             return_train_score=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  Pipeline(step...svc', SVC())])
param_grid  {'svc__C': [0.001, 0.01, ...], 'svc__gamma': [0.001, 0.01, ...]}
scoring  None
n_jobs  -1
refit  True
cv  15
verbose  0
pre_dispatch  '2*n_jobs'
error_score  nan
return_train_score  True
Parameters
transformers  [('standardscaler', ...), ('drop', ...)]
remainder  'drop'
sparse_threshold  0.3
n_jobs  None
transformer_weights  None
verbose  False
verbose_feature_names_out  True
force_int_remainder_cols  'deprecated'
['radius_mean', 'texture_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_max', 'texture_max', 'smoothness_max', 'compactness_max', 'concavity_max', 'concave_points_max', 'symmetry_max', 'fractal_dimension_max']
Parameters
copy  True
with_mean  True
with_std  True
['perimeter_mean', 'area_mean', 'perimeter_se', 'area_se', 'texture_se', 'smoothness_se', 'symmetry_se', 'perimeter_max', 'area_max']
drop
Parameters
C  10
kernel  'rbf'
degree  3
gamma  0.01
coef0  0.0
shrinking  True
probability  False
tol  0.001
cache_size  200
class_weight  None
verbose  False
max_iter  -1
decision_function_shape  'ovr'
break_ties  False
random_state  None
In [19]:
results = pd.DataFrame(gs.cv_results_)

best_performing = results[['param_svc__C', 'param_svc__gamma', 'mean_test_score']].sort_values(
    by='mean_test_score', ascending=False
).head(10)

heatmap_data = results[['param_svc__C', 'param_svc__gamma', 'mean_test_score']].copy()
heatmap_data['C'] = heatmap_data['param_svc__C'].astype(str)
heatmap_data['gamma'] = heatmap_data['param_svc__gamma'].astype(str)

heatmap = alt.Chart(heatmap_data).mark_rect().encode(
    x = alt.X('gamma:N', title='gamma'),
    y = alt.Y('C:N', title='C'),
    color = alt.Color('mean_test_score:Q', scale=alt.Scale(scheme='viridis')),
    tooltip = ['C', 'gamma', 'mean_test_score']
).properties(
    width = 400,
    height = 400,
    title = 'SVM GridSearchCV Mean Test Scores'
)
In [20]:
best_performing
Out[20]:
param_svc__C param_svc__gamma mean_test_score
25 10.0 0.010 0.969176
31 100.0 0.010 0.966667
30 100.0 0.001 0.960287
19 1.0 0.010 0.955986
24 10.0 0.001 0.955914
20 1.0 0.100 0.955914
26 10.0 0.100 0.953620
32 100.0 0.100 0.951470
18 1.0 0.001 0.931613
14 0.1 0.100 0.927455
In [21]:
heatmap.display()
In [22]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = gs.predict(X_test)

report = classification_report(y_test, y_pred, output_dict=True)
report_df = pd.DataFrame(report).transpose().drop('support', axis = 1).drop(['macro avg', 'weighted avg'])
report_df
Out[22]:
precision recall f1-score
Benign 0.986486 1.000000 0.993197
Malignant 1.000000 0.975610 0.987654
accuracy 0.991228 0.991228 0.991228
In [23]:
cm = confusion_matrix(y_test, y_pred)
cm_df = pd.DataFrame(cm, index = gs.classes_, columns = gs.classes_)

cm_melted = cm_df.reset_index().melt(id_vars='index')
cm_melted.columns = ['Actual', 'Predicted', 'Count']

heatmap = alt.Chart(cm_melted).mark_rect().encode(
    x = alt.X('Predicted:N', title = 'Predicted'),
    y = alt.Y('Actual:N', title = 'Actual'),
    color = alt.Color('Count:Q', scale = alt.Scale(scheme ='viridis'))
).properties(
    width = 400,
    height = 400,
    title = 'Confusion Matrix Heatmap'
)

text = alt.Chart(cm_melted).mark_text(color = 'white').encode(
    x = 'Predicted:N',
    y = 'Actual:N',
    text = 'Count:Q'
)

heatmap + text
Out[23]:

Discussion:¶

Our model performed very well, achieving high accuracy on the test set and correctly classifying nearly all cases. This result was generally expected given the strong feature patterns observed during EDA, which suggested clear separation between benign and malignant tumours.

The main concern is the single false negative, where a malignant tumour was predicted as benign. Even though this is rare, such an error carries significant clinical risk and highlights that the model, while strong, is not yet reliable enough for real world medical use.

These results suggest future work should explore methods aimed at reducing false negatives, such as adjusting class weights, using cost-sensitive training, or validating on external datasets to assess robustness.

References¶

  1. Reitz, Kenneth. 2011. Requests HTTP for Humans. https://requests.readthedocs.io/en/master/.

  2. American Cancer Society. 2024. “Breast Cancer Facts & Figures.” https://www.cancer.org/cancer/types/breast-cancer.html.

  3. National Cancer Institute. 2024. “Breast Cancer Treatment (PDQ).” https://www.cancer.gov/types/breast/patient/breast-treatment-pdq.

  4. UCI Machine Learning Repository. 2017. “Breast Cancer Wisconsin (Diagnostic) Data Set.” https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).